J) 



Europaisches 
Patentamt 



CTfrrO O^MAt 2005 

03/04 8 9 4 
. 3 t 10. 03 



European 
Patent Office 



Office europeen 
des brevi 




Bescheinigung Certificate 



Attestation 



Die angehefteten Unterla- 
gen stimmen mit der 
ursprQnglich eingereichten 
Fassung der auf dem nach- 
sten Blatt bezeichneten 
europaischen Patentanmel- 
dung Oberein. 



The attached documents 
are exact copies of the 
European patent application 
described on the following 
page, as originally filed. 



ijes documents fixes a 
cette attestation sont 
conformes a la version 
initialement deposee de 
la demande de brevet 
europeen specifiee a la 
page suivante. 



Patentanmeldung Nr . Patent application No. Demande de brevet n» 

02079720.5 



PRIORITY 
DOCUMENT 

SUBMITTED OR TRANSMITTED IN 
COMPLIANCE WITH RULE 17.1(a) OR (b) 



Der President des Europaischen Patentamts; 
lm Auftrag 

For the President of the European Patent Office 
Le President de i'Office europeen des brevets 
P-o. 



R c van Dtjk 



EPA/EPQ/OEB Form 1014.1 - 02.2000 7001014 



EuropaJsches European Office europeen 

Patentamt Patent Office des breveta 



Anmeldung Nr; 

Application no. : 02079720.5 
Demande no: 



Anmeldetag: 

Date of filing: 12. 11.02 
Date de depot: 



Anrael der/Appl 1 cant( s)/Demandeur( s) : 

Koninklijke Philips Electronics N.V- 
Groenewoudseweg 1 
5621 BA Eindhoven 
PAYS -B AS 



Bezelchnung der Erf 1 ndung/Tl tl e of the 1nvent1on/T1tre de 1 1 Invention: 
(Falls die Bezelchnung der Erflndung nlcht angegeben 1st, slehe Beschrelbung. 
If no title is shown please refer to the description. 
SI aucun tltre n'est Indlque se referer a la description.) 

Fingerprinting multimedia contents 



In Anspruch genommene Pr1or13t(en) / Pr1or1ty(1es) claimed /Pr1or1t€(s) 

revendlquee(s) m 
Staat/Tag/Aktenze1chen/State/Date/Flle no./Pays/Date/Numero de depot: 



Internationale Patentklasslf Ikatl on/International Patent Classification/ 
Classification Internationale des brevets: 

G10L/ 

Am Anmeldetag benannte Vertragstaa ten/Contracting states designated at date of 
flHng/Etats contractants designees lors du depot: 

AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LI LU MC NL PT SE SK TR 



02079720.5 
EPA/EP0/0EB Form 1014.2 - 01.2000 



7001014 



2 



12.NOV.2002 10:19 PHILIPS CIP NL +31 40 27434B9 

PHNL021150BPP 



' NO. 341" P. 7 ' 

007 12.11.2002 09:20:C 



11.11.2002 



Fingorprinthg multimedia contents 



FIELD OP THE INVENTION 

The invention relates to a method and arrangement for extracting a fingerprint 
from a multimedia signal 

5 BACKGROUND OF THE INVENTION 

Fingerprints, in the literature sometimes referred to as a hashes or signatures, 
are binary sequences extracted fixmsmlfimedia contents* that can "be used to identify said 
contents. Unlike cryptographic hashes of data files (that change as soon as a single bit of the 
data file changes), fingerprints of multimedia contents (audio, images, video) are to a certain 

10 extent invariant to processing such as compression and D/A & A/D conversion. That is 
generally achieved by extracting toe fingerprint ftom perceptually essential features of the 
contents. 

A prior art method of extracting a fingerprint from a multimedia signal is 
disclosed in International Patent Application WO 02/065782. The method comprises the 

15 steps of extracting from the multimedia signal a set of robust perceptual features, and 

converting the set of features into the fingerprint For audio signals, the perceptual features 
are energies of the audio contents in selected sub-bands. For image signals, the perceptual 
features are average luminances of blocks into which the image is divided. Hie conversion 
into a binary sequence is performed by thresholding, fbr example, by comparing each feature 

20 sample with its neighbors. 

An attractive application of fingerprinting is content identification. The artist 
and title of a music song or video clip can be identified by extracting a fingerprint from a 
excerpt of the unknown material and sending it to a large database of fingerprints in which 
said information is stored. 

25 Experiments have shown that the prior art method of extracting fingerprints 

from an audio signal is very robust against almost all commonly used audio processing 
operations*, such as MP3 compression and decompression, equalization, re-sampling, noise 
addition, and D/A & A/D conversion, 
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It is quite common for radio stations to speed up audio by a few percent They 
supposedly do this for two reasons. Firstly, the duration of songs is then shorter and therefore 
it enables them to broadcast more commercials, Secondly, the beat of the song is faster and 
the audience seems to prefer this, The.apeed changes typically lie between zero and four 
5 percent 

Speed changes of audio material cause misalignment in both the temporal and 
the frequency domain. The prior art fingerprint extraotion method does not suffer from 
misalignment in the temporal domain, because the fingerprint is a concatenation of small sub* 
jfageiprinta being extracted from overlapping audio frames. A speed change of, say 2%, 
10 merely causes the 250 th sub-fingerprint of an excerpt to be extracted at the position of the 
255 th sub-fingerprint of the corresponding original excetpt. 

Misalignment in the frequency domain is caused by spectral energies shifting 
to other frequencies, The above example of 2% speedup causes all audio frequencies to 
increase by 2%. In the prior art audio fingerprint extraction method, this causes the energies 
15 in the selected sub-bands (and thus the fingerprint) to be changed. As a result thereof, the 
fingerprints can no longer be found in a database, unless a plurality of fingerprints 
corresponding to different speed versions are stored in the database for each song. 

Similar considerations apply to image and video material and to other lcinds of 
perceptual features being used for fingerprint extraction, 

20 

OBJECT AND SUMMARY OF THE INVENTION 

It is an object of the invention to provide an improved method and 
arrangement fbr extracting a fingerprint from multimedia contents. It is a particular object of 
the invention to provide a method and arrangement for extracting a fingerprint from an audio 
25 signal that is substantially invariant to speed changes of the audio signal. 

To this end, the method of extracting a fingerprint from a multimedia signal in 
accordance with the invention comprises the steps of: extracting (12,13) from the multimedia 
signal a set of robust perceptual features; subjecting (15) the extracted set of features to a 
Fourie^Mellin transform; and converting (16,19) the transformed set of features into a 
30 sequence constituting the fingerprint 

The invention exploits the insight that the Fourier-Mellm transform consists of 
a log mapping and a Fourier transform, The log mapping converts scaling of the energy 
spectrum due to a speed change in a shift. The subsequent Fourier transform converts the 
shift into a phase change which iB the same fbr all Fourier coefficients. Magnitudes of the 
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Fourier coefficients are not affected by the speed change. A fingerprint derived from the 
magnitude or ftom the derivative of the phase of the Fourier coefficients is (bus invariant to 
speed change, 

5 BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 shows schematically an arrangement for extracting a fingerprint ftom a 
multimedia signal or, equivalent^, the corresponding steps of a method of extracting such a 
finge^rirtt according to the invention. 

Figs. 2 and 3 show diagrams to illustrate the operation of a log mapping 
10 circuit, which is shown in Fig. 1. 

DESCRIPTION OF EMBODIMENTS 

The invention will be described with reference to an arrangement for 
extracting a fingerprint from an audio signal, Fig. 1 shows schematically such an 

15 arrangement according to the invention- 

The arrangement comprises a framing circuit 11, which divides the audio 
signal into overlapping frames of approx. 0.4 seconds and an overlap factor of 31/32. The 
overlap is chosen such that a high correlation between sub-fingerprints of subsequent frames 
is obtained. Prior to the division into frames, the audio signal has been limited to a frequency 

20 range of approx. 300H^3kHz and down sampled (not shown), so that each frame comprises 
2048 samples. 

A Fourier transform circuit 12 computes the spectral representation of every 
frame. In the next block 13, the power spectrum of the audio frame is computed, for example, 
by squaring the magnitudes of the (complex) Fourier coefficients. For each frame of 2048 
25 audio signal samples, the power spectrum is represented by 1024 samples (positive and 
corresponding negative frequencies have the same magnitudes). The samples of the power 
spectrum constitute a set of robust perceptual features. The spectrum is not substantially 
affected by operations such as D/A & AID conversion or MP3 compression. 

After calculating the power spectrum, an optional normalization circuit 14 
30 applies local normalization to the power spectrum. Such normalization (which includes de- 
convolution and filtering) improves the performance as it obtains a more decisive and robust 
representation of the power spectrum. Local normalization preserves the important 
characteristics of the spectrum and is robust against all kinds of the audio processing 
including local modifications of audio spectrum, such as equalization. The most promising 
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approach, is to emphasize the tonal part of the spectrum by normalizing it with its local mean. 
Mathematically, the normalized Bpecteum N(g>) is obtained by dividing the spectrum A(«) by 
its local mean Lm(o)) as Mows: 

W Lm(©) 



5 The local mean can be calculated in various ways, for example: 

"28 



1 

Lm(co) =— JA(-B)dt (arfthmatio mean), or 



Lm(c))t=exB 



, 0+8 

i- JlogA(x)dx 



(geometric mean) and so on* 



The normalised spectrum remains invariant to equalization. Moreover, tonal information is 
directly related to human hearing and well preserved after most of the audio processing. The 
10 importance of tonal information is widely accepted and has been utilized in audio recognition 
andbit allocation of audio compression- Although local noimalization has many advantages, 
the normalization is not consistent after compression if there are no tonal components 
between w-3 and To mitigate this effect, integration over time and a total-energy term is 
added to Uxx{u). Then a modified local mean Lm'fa ) is given as follows: 

t a+5 t « 

15 Lm'(ffl)=~ J jA(t)dr^ct j JA(T)dt 

2 Vao>-3 t-A-« 
where A and a are constants, which are defceonined experimentally. Integration over time 
makes the normalization more consistent and the total-energy torn limits the increase of 
small non-tonal components after normalization. 

The invention resides in the application of a Fouriep-MeHin transform 15 to 
20 the power spectrum to achieve speed change resilience. The FourieivMelUn transform 

consists of a log mapping process 151 and a Fourier transform (or inverse Fourier transform) 
152. 

pigs, 2 and 3 show diagrams to illustrate the log mapping operation. In Fig. 2, 
reference numeral 21 denotes the samples of the power spectrum of an audio frame as 
25 supplied by the Fourier transform 12 in the case that the audio signal is being played back at 
normal speed. For the sake of convenience, a smooth power spectrum in the range 
300-3,000Hz is shown- In reality, the speotrum will generally exhibit a jagged outline. 
Reference numeral 22 in Fig, 2 denotes the power speotrum oftiie same audio frame in the 
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case that the audio signal ia being played back at an increased speed. As can be seen in the 
Figure, the speed change causes the power spectrum to be scaled. 

Fig. 3 shows the corresponding power spectra as computed by the log mapping 
circuit 151. The power spectrum now represents the energy of the audio frame in a selected 
5 number of successive logarithmically spaced sub-bands. Reference numeral 31 denotes the 
log mapped power spectrum for the audio signal being played back at noimal speed. 
Reference numeral 32 denotes the log mapped power spectrum for the audio signal being 
played back at the increased speed. 

The process of log mapping can be carried out in several ways. In the 
10 embodiment, whioh is shown in Fig. 3, me input power spectrum is interpolated and re- 
sampled at logarithmically spaced Intervals, m another embodiment (not shown), the samples 
within logarithmically spaced (and sized) sub-bands of the input power spectrum are 
accumulated to provide respective samples of the log mapped power spectrum. 

The number of samples representing the log mapped power spectrum is chosen 
15 to be such that subsequent operations can be carried out with sufficient precision. In a 

practical embodiment, the log mapped power spectrum is represented by 512 samples. It will 
be appreciated from inspection of Fig. 3 that the log mapping operation translates the scaling 
(21*22) of me power spectrum due to tho speed change into a shift (31*32). As long as the 
playbackspeed of the audio signal does not change wimin the frame period (which is a 
20 reasonable assiimption in practice), the shift is the same for all coefficients. 

The subsequent Fourier transform 162 translates said shift into a change of the 
phase of the complex Fourier coefficients. The phase change is the same for all coefficients. 
Thus, if the speed of the audio signal changes, the phases of all Fourier coefficients computed 
by Fourier transform circuit 152 change by an identical amount. In other words, the 
25 magnitudes of the coefficients as well as their phase differences are invariant to speed 
change. They are calculated in a computing circuit 16, As me magnitudes and phase 
differences are the same for positive and negative frequencies, the number of unique values is 
256, 

The vector of 256 magnitudes or phase differences representing the log 
30 mapped power spectrum of an audio frame is hereinafter denoted F0wO, where k=1..256 and 
n is the audio frame number. In fact, the vector constitutes a speed change invariant 
fingerprint However, the number of values is large, and each value requires a multi-bit 
representation in a digital fingerprinting system. The number of bits to represent the 
Surprint can be reducedby selecting the lowest order values only. This is performed by a 
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selection circuit 17. K has been found that the 32 lowest values (the most significant 
coefficients) provide a sufficiently accurate representation of the log mapped power 
spectrum, 

The number of bits can further be reduced by subjecting the selected 
5 magnitudes or phase differences to values to a thresholding process, a simple embodiment, 
a thresholding stage 19 generates ono bit for each feature sample, for example, a '1' if the 
value F(k,n) is above a threshold and a *0» if it is below said threshold. Alternatively, a 
fingerprint bit is given the value '1' if the corresponding feature sample F(M) is larger than 
its neighbor, otherwise it is '0'. To mis end, the feature samples F(k,n) are first filtered in a 
10 one-dimensional temporal filter 18. The present embodiment uses an even improved version 
of me latter alternative. In this preferred embodiment, a fingerprint bit * 1 * is generated if the 
feature sample F(k,n) is larger than its neighbor and if that was also the case in the previous 
frame, otherwise the fingerprint bit is «0'.lh mis embodiment, the filter 18 is a two- 
dimensional filter. In mathematical notation: 

fl if F(k,n)-F(k+l,n)-(F(k,n-l)-Fac+U-l))>0 
15 FP(k,n) = j Q tf F(kfn )_F(k+l,n)-(F(k s n-l)-F(k+l ) n-l))sO 

When thresholding is used, each sub-fingerprint being extracted from an audio frame has 32 
bits. 

Although, the invention has been described with reference to audio 
fingerprinting, it can also be applied to other multimedia signals such as images and motion 

20 video. While speed changes are often applied to audio signals, affirm fransfbrmations such as 
shift, scaling and rotation, are often applied to images and video. The method according to 
the invention can be used to improve robustness to such affine transformations. In the case of 
a two-dimensional signal, the log mapping process 151 is changed into log-polar mapping to 
make it invariant against rotation as well as scaling (retaining aspect ratio). A log-log 

25 mapping makes it invariant to changes of the aspect ratio. The magnitude of the Fourier- 
Mellin transform (now a 2D transform) and double differentiation of its phase along the 
frequency axis have the desired affine invariant property, 

Disclosed is a method and arrangement fbr extracting a fingerprint from a 
multimedia signal, particularly an audio signal, which is invariant with respect to speed 

30 changes of the audio signal. To this end, the method comprises extracting (12,13) from the 
multimedia signal a set of robust perceptual features, for example, the power spectrum of the 
audio signal. A Fourier-Meltin transform (15) converts the power spectrum into Fourier • 
coefficients that undergo a phase change only, if (he audio playback speed changes. Their 
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magnitudes or phase differences (16) constitute a speed change invariant fingerprint By a 
thresholding operation (19), the fingerprint can be represented by a compact number of bite. 
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CLAIMS: 



1. A method of extracting a fingerprint from a multimedia signal, comprising the 
steps ofi 

- extracting from the multimedia signal a set of robust perceptual featores; 

- subjecting the extracted set of features to a Fourier-Mellin transform; 

5 - converting the transformed set of featores into a sequence constituting the fingerprint 

2, A method as claimed in claim I, wherein said converting step includes 
converting the magnitudes of the Fourier-Mellin transform. 

10 3. a method as claimed in claim 1, wherein said converting step includes 

converting the derivative of the phase of flic Fourier*-Mellin transform. 

4, A method as claimed in claim 1, wherein the multimedia signal is an audio 
signal and said Fourier-Mellin transform includes a one-dimensional log mapping process 

15 being applied to the set of perceptual features. 

5, A method as claimed in claim 1, wherein the multimedia signal is an image or 
video signal and said Fourier-Mellin transform includes a two-dimensional log-polar 
mapping process being applied to the set of perceptual features. 

20 

6, A method as claimed in claim 1, wherein the multimedia signal is an image or 
video signal and said Fourier-MelUn transform includes a two-dimensional log-log mapping 
process being applied to the set of perceptual features. 

25 ^ A method as claimed in claim 1, wherein said extracting step includes 

normalization of the set of perceptual features. 

g An apparatus for extracting a fingerprint from a multimedia signal, 

comprising: 
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- means 

- means 

- means 



for extracting from the multimedia signal a set of robust p erceptua features; 
fox subjecting the extracted get of features to a Fourier-Mellin traarform; 
for converting u> transformed set of fertues into a sequence constituting tne 
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ABSTRACT: 



Disclosed ib a method and arrangement for extracting a fingerprint from a 
multimedia signal, particularly an audio signal, which is Invariant with respect to speed 
changes of the audio signal To this end, the method comprises extracting (12,13) ftom the 
multimedia signal a set of robust perceptual features, for example, the power spectrum of the 
audio signaL A Fourfer-Mellin transform (15) converts the power spectrum into Fourier 
coefficients that undergo a phase change only, if the audio play bade speed changes. Their 
magnitudes or phase differences (16) constitute a speed change invariant fingerprint, By a 
thresholding operation (19), the fingerprint can be represented by a compact number of bits. 



10 Fig. 1 



~~ -snots PHILIPS CIH* NL +31 40 2743489 NO. 341 P. 17 

12.NOV.200ii 10-23 • 017 12.11.2002 09:23 

PHNU021150 • 



1/2 



r 



L. 



Audio 



I 



^11 



FFT K 12 



I 



Norm 






r 


~ 1M 


Log map 



I 



FFT 



1 

I 
I 

rv j 



16 



Select K17 

31 



(2D) FilterK^ 18 

ze: 

Threshold 19 
FP(n) 



FIG.1 



12. NOV. 2002 10:23 PHILIPS CIP NL +31 40 2743489 NO. 341 P. 10 

018 12.11-2002 09:23: 

PHNL021160 



2/2 



^^^^ 



300 600 900 1,200 1,500 1,800 2,100 2,400 2,700 3,000 



FIG. 2 



***** + «> 



* * 5 4 * ^-r » 



+ <l 



300 



— i — i — i — 

1000 

FIG. 3 



2,000 3,000 



